Dynamic Predictive Modeling Platform

ABSTRACT

Methods, systems, and apparatus, including computer programs encoded on one or more computer storage devices, for training and retraining predictive models. A series of training data sets are received and added to a training data queue. In response to a first condition being satisfied, multiple retrained predictive models are generated using the training data queue, multiple updateable trained predictive models obtained from a repository of trained predictive models, and multiple training functions. In response to a second condition being satisfied, multiple new trained predictive models are generated using the training data queue, at least some training data stored in a training data repository and training functions. The new trained predictive models include static trained predictive models and updateable trained predictive models. The repository of trained predictive models is updated with at least some of the retrained predictive models and new trained predictive models.

CROSS-REFERENCE TO RELATED APPLICATION

This application is a continuation of U.S. application Ser. No.13/014,252, titled “Dynamic Predictive Modeling Platform” filed Jan. 26,2011. The entire contents of which is incorporated herein by reference.

TECHNICAL FIELD

This specification relates to training and retraining predictive models.

BACKGROUND

Predictive analytics generally refers to techniques for extractinginformation from data to build a model that can predict an output from agiven input. Predicting an output can include predicting future trendsor behavior patterns, or performing sentiment analysis, to name a fewexamples. Various types of predictive models can be used to analyze dataand generate predictive outputs. Typically, a predictive model istrained with training data that includes input data and output data thatmirror the form of input data that will be entered into the predictivemodel and the desired predictive output, respectively. The amount oftraining data that may be required to train a predictive model can belarge, e.g., in the order of gigabytes or terabytes. The number ofdifferent types of predictive models available is extensive, anddifferent models behave differently depending on the type of input data.Additionally, a particular type of predictive model can be made tobehave differently, for example, by adjusting the hyper-parameters orvia feature induction or selection.

SUMMARY

In general, in one aspect, the subject matter described in thisspecification can be embodied in a computer-implemented system thatincludes one or more computers and one or more data storage devicescoupled to the one or more computers. The one or more storage devicesstore: a repository of training functions, a repository of trainedpredictive models (including static trained predictive models andupdateable trained predictive models), a training data queue, a trainingdata repository, and instructions that, when executed by the one or morecomputers, cause the one or more computers to perform operations. Theoperations include receiving a series of training data set and addingthe training data sets to the training data queue. In response to afirst condition being satisfied, multiple retrained predictive modelsare generated using the training data queue, multiple updateable trainedpredictive models obtained from the repository of trained predictivemodels, and multiple training functions obtained from the repository oftraining functions. The repository of trained predictive models isupdated by storing one or more of the generated retrained predictivemodels. In response to a second condition being satisfied, multiple newtrained predictive models are generated using the training data queueand at least some of the training data stored in the training datarepository and training functions obtained from the repository oftraining functions. The new trained predictive models include statictrained predictive models and updateable trained predictive models. Therepository of trained predictive models is updated by storing at leastsome of the new trained predictive models. Other embodiments of thisaspect include corresponding methods and computer programs recorded oncomputer storage devices, each configured to perform the actionsdescribed above.

These and other embodiments can each optionally include one or more ofthe following features, alone or in combination. The series of trainingdata sets can be received incrementally or together in a batch. Thefirst condition can be satisfied when: a size of the training data queueis greater than or equal to a threshold size; a command is received toupdate the updateable trained predictive models included in therepository of trained predictive models; or a predetermined time periodhas expired. The second condition can be satisfied: in response toreceiving a command to update the static models and the updateablemodels included in the repository of trained predictive models; after apredetermined time period has expired; or when a size of the trainingdata queue is greater than or equal to a threshold size.

The system can further include a user interface configured to receiveuser input specifying a data retention policy that defines rules formaintaining and deleting training data included in the training datarepository.

The operations can further include generating updated training data thatincludes at least some of the training data from the training data queueand at least some of the training data from the training datarepository, and updating the training data repository by storing theupdated training data. Generating the updated training data can includeimplementing a data retention policy that defines rules for maintainingand deleting training data included in at least one of the training dataqueue or the training data repository. The data retention policy caninclude a rule for deleting training data from the training datarepository when the training data repository size reaches apredetermined size limit.

Updating the repository of trained predictive models by storing one ormore of the generated retrained predictive models can include, for eachof the retrained predictive models: comparing an effectiveness score ofthe retrained predictive model to an effectiveness score of theupdateable trained predictive model from the predictive model repositorythat was used to generate the retrained predictive model; and based onthe comparison, selecting a first of the two predictive models to storein the repository of predictive models and not storing a second of thetwo predictive models in the repository. The effectiveness score for atrained predictive model is a score that represents an estimation of theeffectiveness of the trained predictive model.

In general, in another aspect, the subject matter described in thisspecification can be embodied in a computer-implemented that includesreceiving new training data and adding the new training data to atraining data queue. Whether a size of the training data queue size isgreater than a threshold size is determined. When the training dataqueue size is greater than the threshold size, multiple stored trainedpredictive models and a stored training data set are retrieved. Each ofthe stored trained predictive models was generated using the trainingdata set and a training function and is associated with a score thatrepresents an estimation of the effectiveness of the predictive model.Multiple retrained predictive models are generated using the trainingdata queue, the retrieved plurality of trained predictive models andtraining functions. A new score associated each of the generatedretrained predictive models is generated. At least some of the trainingdata from the training data queue is added to the stored training dataset. Other embodiments of this aspect include corresponding systems,apparatus, and computer programs recorded on computer storage devices,each configured to perform the actions of the methods.

In some implementations, the threshold can be a predetermined data sizeor a predetermined ratio of the training data queue size to the size ofthe stored training data set.

Particular embodiments of the subject matter described in thisspecification can be implemented so as to realize one or more of thefollowing advantages. A dynamic repository of trained predictive modelscan be maintained that includes updateable trained predictive models.The updateable trained predictive models can be dynamically updated asnew training data becomes available. Static trained predictive models(i.e., predictive models that are not updateable) can be regeneratedusing an updated set of training data. A most effective trainedpredictive model can be selected from the dynamic repository and used toprovide a predictive output in response to receiving input data. Themost effective trained predictive model in the dynamic repository canchange over time as new training data becomes available and is used toupdate the repository (i.e., to update and/or regenerate the trainedpredictive models). A service can be provided, e.g., “in the cloud”,where a client computing system can provide input data and a predictionrequest and receive in response a predictive output without expendingclient-side computing resources or requiring client-side expertise forpredictive analytical modeling. The client computing system canincrementally provide new training data and be provided access to themost effective trained predictive model available at a given time, basedon the training data provided by the client computing system as of thatgiven time. An updateable trained predictive model that gives anerroneous predictive output can be easily and quickly corrected, forexample, by providing the correct output as an update training sampleupon detecting the error in output.

The details of one or more embodiments of the subject matter describedin this specification are set forth in the accompanying drawings and thedescription below. Other features, aspects, and advantages of thesubject matter will become apparent from the description, the drawings,and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic representation of a system that provides apredictive analytic platform.

FIG. 2 is a schematic block diagram showing a system for providing apredictive analytic platform over a network.

FIG. 3 is a flowchart showing an example process for using thepredictive analytic platform from the perspective of the clientcomputing system.

FIG. 4 is a flowchart showing an example process for serving a clientcomputing system using the predictive analytic platform.

FIG. 5 is a flowchart showing an example process for using thepredictive analytic platform from the perspective of the clientcomputing system.

FIG. 6 is a flowchart showing an example process for retrainingupdateable trained predictive models using the predictive analyticplatform.

FIG. 7 is a flowchart showing an example process for generating a newset of trained predictive models using updated training data.

FIG. 8 is a flowchart showing an example process for maintaining anupdated dynamic repository of trained predictive models.

Like reference numbers and designations in the various drawings indicatelike elements.

DETAILED DESCRIPTION

Methods and systems are described that provide a dynamic repository oftrained predictive models, at least some of which can be updated as newtraining data becomes available. A trained predictive model from thedynamic repository can be provided and used to generate a predictiveoutput for a given input. As a particular client entity's training datachanges over time, the client entity can be provided access to a trainedpredictive model that has been trained with training data reflective ofthe changes. As such, the repository of trained predictive models fromwhich a predictive model can be selected to use to generate a predictiveoutput is “dynamic”, as compared to a repository of trained predictivemodels that are not updateable with new training data and are therefore“static”.

FIG. 1 is a schematic representation of a system that provides apredictive analytic platform. The system 100 includes multiple clientcomputing systems 104 a-c that can communicate with a predictivemodeling server system 109. In the example shown, the client computingsystems 104 a-c can communicate with a server system front end 110 byway of a network 102. The network 102 can include one or more local areanetworks (LANs), a wide area network (WAN), such as the Internet, awireless network, such as a cellular network, or a combination of all ofthe above. The server system front end 110 is in communication with, oris included within, one or more data centers, represented by the datacenter 112. A data center 112 generally is a large numbers of computers,housed in one or more buildings, that are typically capable of managinglarge volumes of data.

A client entity—an individual or a group of people or a company, forexample—may desire a trained predictive model that can receive inputdata from a client computing system 104 a belonging to or under thecontrol of the client entity and generate a predictive output. To traina particular predictive model can require a significant volume oftraining data, for example, one or more gigabytes of data. The clientcomputing system 104 a may be unable to efficiently manage such a largevolume of data. Further, selecting and tuning an effective predictivemodel from the variety of available types of models can require skilland expertise that an operator of the client computing system 104 a maynot possess.

The system 100 described here allows training data 106 a to be uploadedfrom the client computing system 104 a to the predictive modeling serversystem 109 over the network 102. The training data 106 a can includeinitial training data, which may be a relatively large volume oftraining data the client entity has accumulated, for example, if theclient entity is a first-time user of the system 100. The training data106 a can also include new training data that can be uploaded from theclient computing system 104 a as additional training data becomesavailable. The client computing system 104 a may upload new trainingdata whenever the new training data becomes available on an ad hocbasis, periodically in batches, in a batch once a certain volume hasaccumulated, or otherwise.

The server system front end 110 can receive, store and manage largevolumes of data using the data center 112. One or more computers in thedata center 112 can run software that uses the training data to estimatethe effectiveness of multiple types of predictive models and make aselection of a trained predictive model to be used for data receivedfrom the particular client computing system 104 a. The selected modelcan be trained and the trained model made available to users who haveaccess to the predictive modeling server system 109 and, optionally,permission from the client entity that provided the training data forthe model. Access and permission can be controlled using anyconventional techniques for user authorization and authentication andfor access control, if restricting access to the model is desired. Theclient computing system 104 a can transmit prediction requests 108 aover the network. The selected trained model executing in the datacenter 112 receives the prediction request, input data and request for apredictive output, and generates the predictive output 114. Thepredictive output 114 can be provided to the client computing system 104a, for example, over the network 102.

Advantageously, when handling large volumes of training data and/orinput data, the processes can be scaled across multiple computers at thedata center 112. The predictive modeling server system 109 canautomatically provision and allocate the required resources, using oneor more computers as required. An operator of the client computingsystem 104 a is not required to have any special skill or knowledgeabout predictive models. The training and selection of a predictivemodel can occur “in the cloud”, i.e., over the network 102, therebylessening the burden on the client computing system's processorcapabilities and data storage, and also reducing the requiredclient-side human resources.

The term client computing system is used in this description to refer toone or more computers, which may be at one or more physical locations,that can access the predictive modeling server system. The data center112 is capable of handling large volumes of data, e.g., on the scale ofterabytes or larger, and as such can serve multiple client computingsystems. For illustrative purposes, three client computing systems 104a-c are shown, however, scores of client computing systems can be servedby such a predictive modeling server system 109.

FIG. 2 is a schematic block diagram showing a system 200 for providing adynamic predictive analytic platform over a network. For illustrativepurposes, the system 200 is shown with one client computing system 202communicating over a network 204 with a predictive modeling serversystem 206. However, it should be understood that the predictivemodeling server system 206, which can be implemented using multiplecomputers that can be located in one or more physical locations, canserve multiple client computing systems. In the example shown, thepredictive modeling server system includes an interface 208. In someimplementations the interface 208 can be implemented as one or moremodules adapted to interface with components included in the predictivemodeling server system 206 and the network 204, for example, thetraining data queue 213, the training data repository 214, the modelselection module 210 and/or the trained model repository 218.

FIG. 3 is a flowchart showing an example process 300 for using thepredictive analytic platform from the perspective of the clientcomputing system 202. The process 300 would be carried out by the clientcomputing system 202 when the corresponding client entity was uploadingthe initial training data to the system 206. The client computing system202 uploads training data (i.e., the initial training data) to thepredictive modeling server system 206 over the network 204 (Step 302).In some implementations, the initial training data is uploaded in bulk(e.g., a batch) by the client computing system 202. In otherimplementations, the initial training data is uploaded incrementally bythe client computing system 202 until a threshold volume of data hasbeen received that together forms the “initial training data”. The sizeof the threshold volume can be set by the system 206, the clientcomputing system 202 or otherwise determined. In response, the clientcomputing system 202 receives access to a trained predictive model, forexample, trained predictive model 218 (Step 304).

In the implementations shown, the trained predictive model 218 is notitself provided. The trained predictive model 218 resides and executesat a location remote from the client computing system 202. For example,referring back to FIG. 1, the trained predictive model 218 can resideand execute in the data center 112, thereby not using the resources ofthe client computing system 202. Once the client computing system 202has access to the trained predictive model 218, the client computingsystem can send input data and a prediction request to the trainedpredictive model (Step 306). In response, the client computing systemreceives a predictive output generated by the trained predictive modelfrom the input data (Step 308).

From the perspective of the client computing system 202, training anduse of a predictive model is relatively simple. The training andselection of the predictive model, tuning of the hyper-parameters andfeatures used by the model (to be described below) and execution of thetrained predictive model to generate predictive outputs is all doneremote from the client computing system 202 without expending clientcomputing system resources. The amount of training data provided can berelatively large, e.g., gigabytes or more, which is often an unwieldyvolume of data for a client entity.

The predictive modeling server system 206 will now be described in moredetail with reference to the flowchart shown in FIG. 4. FIG. 4 is aflowchart showing an example process 400 for serving a client computingsystem using the predictive analytic platform. The process 400 iscarried out to provide access of a selected trained predictive model tothe client computing system, which trained predictive model has beentrained using initial training data. Providing accessing to the clientcomputing system of a predictive model that has been retrained using newtraining data (i.e., training data available after receiving the initialtraining data) is described below in reference to FIGS. 5 and 6.

Referring to FIG. 4, training data (i.e., initial training data) isreceived from the client computing system (Step 402). For example, theclient computing system 202 can upload the training data to thepredictive modeling server system 206 over the network 204 eitherincrementally or in bulk (i.e., as batch). As describe above, if theinitial training data is uploaded incrementally, the training data canaccumulate until a threshold volume is received before training ofpredictive models is initiated. The training data can be in anyconvenient form that is understood by the modeling server system 206 todefine a set of records, where each record includes an input and acorresponding desired output. By way of example, the training data canbe provided using a comma-separated value format, or a sparse vectorformat. In another example, the client computing system 202 can specifya protocol buffer definition and upload training data that complies withthe specified definition.

The process 400 and system 200 can be used in various differentapplications. Some examples include (without limitation) makingpredictions relating to customer sentiment, transaction risk, speciesidentification, message routing, diagnostics, churn prediction, legaldocket classification, suspicious activity, work roster assignment,inappropriate content, product recommendation, political bias, upliftmarketing, e-mail filtering and career counseling. For illustrativepurposes, the process 400 and system 200 will be described using anexample that is typical of how predictive analytics are often used. Inthis example, the client computing system 202 provides a web-basedonline shopping service. The training data includes multiple records,where each record provides the online shopping transaction history for aparticular customer. The record for a customer includes the dates thecustomer made a purchase and identifies the item or items purchased oneach date. The client computing system 202 is interested in predicting anext purchase of a customer based on the customer's online shoppingtransaction history.

Various techniques can be used to upload a training request and thetraining data from the client computing system 202 to the predictivemodeling server system 206. In some implementations, the training datais uploaded using an HTTP web service. The client computing system 202can access storage objects using a RESTful API to upload and to storetheir training data on the predictive modeling server system 206. Inother implementations, the training data is uploaded using a hostedexecution platform, e.g., AppEngine available from Google Inc. ofMountain View, Calif. The predictive modeling server system 206 canprovide utility software that can be used by the client computing system202 to upload the data. In some implementations, the predictive modelingserver system 206 can be made accessible from many platforms, includingplatforms affiliated with the predictive modeling server system 206,e.g., for a system affiliated with Google, the platform could be aGoogle App Engine or Apps Script (e.g., from Google Spreadsheet), andplatforms entirely independent of the predictive modeling server system206, e.g., a desktop application. The training data can be large, e.g.,many gigabytes. The predictive modeling server system 206 can include adata store, e.g., the training data repository 214, operable to storethe received training data.

The predictive modeling server system 206 includes a repository oftraining functions for various predictive models, which in the exampleshown are included in the training function repository 216. At leastsome of the training functions included in the repository 216 can beused to train an “updateable” predictive model. An updateable predictivemodel refers to a trained predictive model that was trained using afirst set of training data (e.g., initial training data) and that can beused together with a new set of training data and a training function togenerate a “retrained” predictive model. The retrained predictive modelis effectively the initial trained predictive model updated with the newtraining data. One or more of the training functions included in therepository 216 can be used to train “static” predictive models. A staticpredictive model refers to a predictive model that is trained with abatch of training data (e.g., initial training data) and is notupdateable with incremental new training data. If new training data hasbecome available, a new static predictive model can be trained using thebatch of new training data, either alone or merged with an older set oftraining data (e.g., the initial training data) and an appropriatetraining function.

Some examples of training functions that can be used to train a staticpredictive model include (without limitation): regression (e.g., linearregression, logistic regression), classification and regression tree,multivariate adaptive regression spline and other machine learningtraining functions (e.g., Naïve Bayes, k-nearest neighbors, SupportVector Machines, Perceptron). Some examples of training functions thatcan be used to train an updateable predictive model include (withoutlimitation) Online Bayes, Rewritten Winnow, Support Vector Machine (SVM)Analogue, Maximum Entrophy (MaxEnt) Analogue, Gradient based (FOBOS) andAdaBoost with Mixed Norm Regularization. The training functionrepository 216 can include one or more of these example trainingfunctions.

Referring again to FIG. 4, multiple predictive models, which can be allor a subset of the available predictive models, are trained using someor all of the training data (Step 404). In the example predictivemodeling server system 206, a model training module 212 is operable totrain the multiple predictive models. The multiple predictive modelsinclude one or more updateable predictive models and can include one ormore static predictive models.

The client computing system 202 can send a training request to thepredictive modeling server system 206 to initiate the training of amodel. For example, a GET or a POST request could be used to make atraining request to a URL. A training function is applied to thetraining data to generate a set of parameters. These parameters form thetrained predictive model. For example, to train (or estimate) a NaïveBayes model, the method of maximum likelihood can be used. A given typeof predictive model can have more than one training function. Forexample, if the type of predictive model is a linear regression model,more than one different training function for a linear regression modelcan be used with the same training data to generate more than onetrained predictive model.

For a given training function, multiple different hyper-parameterconfigurations can be applied to the training function, again generatingmultiple different trained predictive models. Therefore, in the presentexample, where the type of predictive model is a linear regressionmodel, changes to an L1 penalty generate different sets of parameters.Additionally, a predictive model can be trained with different features,again generating different trained models. The selection of features,i.e., feature induction, can occur during multiple iterations ofcomputing the training function over the training data. For example,feature conjunction can be estimated in a forward stepwise fashion in aparallel distributed way enabled by the computing capacity of thepredictive modeling server system, i.e., the data center.

Considering the many different types of predictive models that areavailable, and then that each type of predictive model may have multipletraining functions and that multiple hyper-parameter configurations andselected features may be used for each of the multiple trainingfunctions, there are many different trained predictive models that canbe generated. Depending on the nature of the input data to be used bythe trained predictive model to predict an output, different trainedpredictive models perform differently. That is, some can be moreeffective than others.

The effectiveness of each of the trained predictive models is estimated(Step 406). For example, a model selection module 210 is operable toestimate the effectiveness of each trained predictive model. In someimplementations, cross-validation is used to estimate the effectivenessof each trained predictive model. In a particular example, a 10-foldcross-validation technique is used. Cross-validation is a techniquewhere the training data is partitioned into sub-samples. A number of thesub-samples are used to train an untrained predictive model, and anumber of the sub-samples (usually one) is used to test the trainedpredictive model. Multiple rounds of cross-validation can be performedusing different sub-samples for the training sample and for the testsample. K-fold cross-validation refers to portioning the training datainto K sub-samples. One of the sub-samples is retained as the testsample, and the remaining K−1 sub-samples are used as the trainingsample. K rounds of cross-validation are performed, using a differentone of the sub-samples as the test sample for each round. The resultsfrom the K rounds can then be averaged, or otherwise combined, toproduce a cross-validation score. 10-fold cross-validation is commonlyused.

In some implementations, the effectiveness of each trained predictivemodel is estimated by performing cross-validation to generate across-validation score that is indicative of the accuracy of the trainedpredictive model, i.e., the number of exact matches of output datapredicted by the trained model when compared to the output data includedin the test sub-sample. In other implementations, one or more differentmetrics can be used to estimate the effectiveness of the trained model.For example, cross-validation results can be used to indicate whetherthe trained predictive model generated more false positive results thantrue positives and ignores any false negatives.

In other implementations, techniques other than, or in addition to,cross-validation can be used to estimate the effectiveness. In oneexample, the resource usage costs for using the trained model can beestimated and can be used as a factor to estimate the effectiveness ofthe trained model.

In some implementations, the predictive modeling server system 206operates independently from the client computing system 202 and selectsand provides the trained predictive model 218 as a specialized service.The expenditure of both computing resources and human resources andexpertise to select the untrained predictive models to include in thetraining function repository 216, the training functions to use for thevarious types of available predictive models, the hyper-parameterconfigurations to apply to the training functions and thefeature-inductors all occurs server-side. Once these selections havebeen completed, the training and model selection can occur in anautomated fashion with little or no human intervention, unless changesto the server system 206 are desired. The client computing system 202thereby benefits from access to a trained predictive model 218 thatotherwise might not have been available to the client computing system202, due to limitations on client-side resources.

Referring again to FIG. 4, each trained model is assigned a score thatrepresents the effectiveness of the trained model. As discussed above,the criteria used to estimate effectiveness can vary. In the exampleimplementation described, the criterion is the accuracy of the trainedmodel and is estimated using a cross-validation score. Based on thescores, a trained predictive model is selected (Step 408). In someimplementations, the trained models are ranked based on the value oftheir respective scores, and the top ranking trained model is chosen asthe selected predictive model. Although the selected predictive modelwas trained during the evaluation stage described above, training atthat stage may have involved only a sample of the training data, or notall of the training data at one time. For example, if k-foldcross-validation was used to estimate the effectiveness of the trainedmodel, then the model was not trained with all of the training data atone time, but rather only K−1 partitions of the training data.Accordingly, if necessary, the selected predictive model is fullytrained using the training data (e.g., all K partitions) (Step 410), forexample, by the model training module 212. A trained model (i.e., “fullytrained” model) is thereby generated for use in generating predictiveoutput, e.g., trained predictive model 218. The trained predictive model218 can be stored by the predictive modeling server system 206. That is,the trained predictive model 218 can reside and execute in a data centerthat is remote from the client computing system 202.

Of the multiple trained predictive models that were trained as describedabove, some or all of them can be stored in the predictive modelrepository 215. Each trained predictive model can be associated with itsrespective effectiveness score. One or more of the trained predictivemodels in the repository 215 are updateable predictive models. In someimplementations, the predictive models stored in the repository 215 aretrained using the entire initial training data, i.e., all K partitionsand not just K−1 partitions. In other implementations, the trainedpredictive models that were generated in the evaluation phase using K−1partitions are stored in the repository 215, so as to avoid expendingadditional resources to recompute the trained predictive models usingall K partitions.

Access to the trained predictive model is provided (Step 412) ratherthan the trained predictive model itself. In some implementations,providing access to the trained predictive model includes providing anaddress to the client computing system 202 or other user computingplatform that can be used to access the trained model; for example, theaddress can be a URL (Universal Resource Locator). Access to the trainedpredictive model can be limited to authorized users. For example, a usermay be required to enter a user name and password that has beenassociated with an authorized user before the user can access thetrained predictive model from a computing system, including the clientcomputing system 202. If the client computing system 202 desires toaccess the trained predictive model 218 to receive a predictive output,the client computing system 202 can transmit to the URL a request thatincludes the input data. The predictive modeling server system 206receives the input data and prediction request from the client computingsystem 202 (Step 414). In response, the input data is input to thetrained predictive model 218 and a predictive output generated by thetrained model (Step 416). The predictive output is provided; it can beprovided to the client computing system (Step 418).

In some implementations, where the client computing system is providedwith a URL to access the trained predictive model, input data and arequest to the URL can be embedded in an HTML document, e.g., a webpage.In one example, JavaScript can be used to include the request to the URLin the HTML document. Referring again to the illustrative example above,when a customer is browsing on the client computing system's web-basedonline shopping service, a call to the URL can be embedded in a webpagethat is provided to the customer. The input data can be the particularcustomer's online shopping transaction history. Code included in thewebpage can retrieve the input data for the customer, which input datacan be packaged into a request that is sent in a request to the URL fora predictive output. In response to the request, the input data is inputto the trained predictive model and a predictive output is generated.The predictive output is provided directly to the customer's computer orcan be returned to the client computer system, which can then forwardthe output to the customer's computer. The client computing system 202can use and/or present the predictive output result as desired by theclient entity. In this particular example, the predictive output is aprediction of the type of product the customer is most likely to beinterested in purchasing. If the predictive output is “blender”, then,by way of example, an HTML document executing on the customer's computermay include code that in response to receiving the predictive outputcause to display on the customer's computer one or more images and/ordescriptions of blenders available for sale on the client computingsystem's online shopping service. This integration is simple for theclient computing system, because the interaction with the predictivemodeling server system can use a standard HTTP protocol, e.g. GET orPOST can be used to make a request to a URL that returns a JSON(JavaScript Object Notation) encoded output. The input data also can beprovided in JSON format.

The customer using the customer computer can be unaware of theseoperations, which occur in the background without necessarily requiringany interaction from the customer. Advantageously, the request to thetrained predictive model can seamlessly be incorporated into the clientcomputer system's web-based application, in this example an onlineshopping service. A predictive output can be generated for and receivedat the client computing system (which in this example includes thecustomer's computer), without expending client computing systemresources to generate the output.

In other implementations, the client computing system can use code(provided by the client computing system or otherwise) that isconfigured to make a request to the predictive modeling server system206 to generate a predictive output using the trained predictive model218. By way of example, the code can be a command line program (e.g.,using cURL) or a program written in a compiled language (e.g., C, C++,Java) or an interpreted language (e.g., Python). In someimplementations, the trained model can be made accessible to the clientcomputing system or other computer platforms by an API through a hosteddevelopment and execution platform, e.g., Google App Engine.

In the implementations described above, the trained predictive model 218is hosted by the predictive modeling server system 206 and can resideand execute on a computer at a location remote from the client computingsystem 202. However, in some implementations, once a predictive modelhas been selected and trained, the client entity may desire to downloadthe trained predictive model to the client computing system 202 orelsewhere. The client entity may wish to generate and deliver predictiveoutputs on the client's own computing system or elsewhere. Accordingly,in some implementations, the trained predictive model 218 is provided toa client computing system 202 or elsewhere, and can be used locally bythe client entity.

Components of the client computing system 202 and/or the predictivemodeling system 206, e.g., the model training module 212, modelselection module 210 and trained predictive model 218, can be realizedby instructions that upon execution cause one or more computers to carryout the operations described above. Such instructions can comprise, forexample, interpreted instructions, such as script instructions, e.g.,JavaScript or ECMAScript instructions, or executable code, or otherinstructions stored in a computer readable medium. The components of theclient computing system 202 and/or the predictive modeling system 206can be implemented in multiple computers distributed over a network,such as a server farm, in one or more locations, or can be implementedin a single computer device.

As discussed above, the predictive modeling server system 206 can beimplemented “in the cloud”. In some implementations, the predictivemodeling server system 206 provides a web-based service. A web page at aURL provided by the predictive modeling server system 206 can beaccessed by the client computing system 202. An operator of the clientcomputing system 202 can follow instructions displayed on the web pageto upload training data “to the cloud”, i.e., to the predictive modelingserver system 206. Once completed, the operator can enter an input toinitiate the training and selecting operations to be performed “in thecloud”, i.e., by the predictive modeling server system 206, or theseoperations can be automatically initiated in response to the trainingdata having been uploaded.

The operator of the client computing system 202 can access the one ormore trained models that are available to the client computing system202 from the web page. For example, if more than one set of trainingdata (e.g., relating to different types of input that correspond todifferent types of predictive output) had been uploaded by the clientcomputing system 202, then more than one trained predictive model may beavailable to the particular client computing system. Representations ofthe available predictive models can be displayed, for example, by nameslisted in a drop down menu or by icons displayed on the web page,although other representations can be used. The operator can select oneof the available predictive models, e.g., by clicking on the name oricon. In response, a second web page (e.g., a form) can be displayedthat prompts the operator to upload input data that can be used by theselected trained model to provide predictive output data (in someimplementations, the form can be part of the first web page describedabove). For example, an input field can be provided, and the operatorcan enter the input data into the field. The operator may also be ableto select and upload a file (or files) from the client computing system202 to the predictive modeling server system 206 using the form, wherethe file or files contain the input data. In response, the selectedpredicted model can generate predictive output based on the input dataprovided, and provide the predictive output to the client computingsystem 202 either on the same web page or a different web page. Thepredictive output can be provided by displaying the output, providing anoutput file or otherwise.

In some implementations, the client computing system 202 can grantpermission to one or more other client computing systems to access oneor more of the available trained predictive models of the clientcomputing system. The web page used by the operator of the clientcomputing system 202 to access the one or more available trainedpredictive models can be used (either directly or indirectly as a linkto another web page) by the operator to enter information identifyingthe one or more other client computing systems being granted access andpossibly specifying limits on their accessibility. Conversely, if theclient computing system 202 has been granted access by a third party(i.e., an entity controlling a different client computing system) toaccess one or more of the third party's trained models, the operator ofthe client computing system 202 can access the third party's trainedmodels using the web page in the same manner as accessing the clientcomputing system's own trained models (e.g., by selecting from a dropdown menu or clicking an icon).

FIG. 5 is a flowchart showing an example process 500 for using thepredictive analytic platform from the perspective of the clientcomputing system. For illustrative purposes, the process 500 isdescribed in reference to the predictive modeling server system 206 ofFIG. 2, although it should be understood that a differently configuredsystem could perform the process 500. The process 500 would be carriedout by the client computing system 202 when the corresponding cliententity was uploading the “new” training data to the system 206. That is,after the initial training data had been uploaded by the clientcomputing system and used to train multiple predictive models, at leastone of which was then made accessible to the client computing system,additional new training data becomes available. The client computingsystem 202 uploads the new training data to the predictive modelingserver system 206 over the network 204 (Box 502).

In some implementations, the client computing system 202 uploads newtraining data sets serially. For example, the client computing system202 may upload a new training data set whenever one becomes available,e.g., on an ad hoc basis. In another example, the client computingsystem 202 may upload a new training data set according to a particularschedule, e.g., at the end of each day. In some implementations, theclient computing system 202 uploads a series of new training data setsbatched together into one relatively large batch. For example, theclient computing system 202 may upload a new batch of training data setswhenever the batched series of training data sets reach a certain size(e.g., number of mega-bytes). In another example, the client computingsystem 202 may upload a new batch of training data sets accordingly to aparticular schedule, e.g., once a month.

Table 1 below shows some illustrative examples of commands that can beused by the client computing system 202 to upload a new training dataset that includes an individual update, a group update (e.g. multipleexamples within an API call), an update from a file and an update froman original file (i.e., a file previously used to upload training data).

TABLE 1 Type of Update Command Individual curl -X POST -H . . . Update-d “{\”data\”:{\”input\”:{\”mixture\”: [0,2]} \”output\”:[0]}}}”https .. . /bucket%2Ffile.csv/update Individual curl -X POST -H . . . -d“{\”data\”:{\”data\”: Update [0,0,2]}} https . . ./bucket%2Ffile.csv/update Group curl -X POST -H . . .-d“{\”data\”:{\”input\”:{\”mixture\”: Update [[0,2],[1,2] . . .[x,y]]}\”output\”:[0, 1 . . . z]}}}” https . . ./bucket%2Ffile.csv/update Group curl -X POST -H . . .-d“{\”data\”:{\”data\”: Update [[0,0,.2],[1 ,1 ,2] . . . [z,x,y]]}}https . . . /bucket%2Ffile.csv/update Update from curl -X POST -H . .. - d “bucket%2Fnewfile” File https . . . /bucket%2Ffile.csv/updateUpdate from curl -X POST -H . . . https . . . /bucket%2Ffile.csv/updateOriginal File

In the above example command, “data” refers to data used in training themodels (i.e., training data); “mixture” refers to a combination of textand numeric data, “input” refers to data to be used to update the model(i.e., new training data), “bucket” refers to a location where themodels to be updated are stored, “x”, “y” and “z” refer to otherpotential data values for a given feature.

The series of training data sets uploaded by the client computing system202 can be stored in the training data queue 213 shown in FIG. 2. Insome implementations, the training data queue 213 accumulates newtraining data until an update of the updateable trained predictivemodels included in the predictive model repository 215 is performed. Inother implementations, the training data queue 213 only retains a fixedamount of data or is otherwise limited. In such implementations, oncethe training data queue 213 is full, an update can be performedautomatically, a request can be sent to the client computing system 202requesting instructions to perform an update, or training data in thequeue 213 can be deleted to make room for more new training data. Otherevents can trigger a retraining, as is discussed further below.

The client computing system 202 can request that their trainedpredictive models be updated (Box 504). For example, when the clientcomputing system 202 uploads the series of training data sets (eitherincrementally or in batch or a combination of both), an update requestcan be included or implied, or the update request can be madeindependently of uploading new training data.

In some implementations, an update automatically occurs upon a conditionbeing satisfied. For example, receiving new training data in and ofitself can satisfy the condition and trigger the update. In anotherexample, receiving an update request from the client computing system202 can satisfy the condition. Other examples are described further inreference to FIG. 5.

As described above in reference to FIGS. 2 and 4, the predictive modelrepository 215 includes multiple trained predictive models that weretrained using training data uploaded by the client computing system 202.At least some of the trained predictive models included in therepository 215 are updateable predictive models. When an update of theupdateable predictive models occurs, retrained predictive models aregenerated using the data in the training data queue 213, the updateablepredictive models and the corresponding training functions that wereused to train the updateable predictive models. Each retrainedpredictive model represents an update to the predictive model that wasused to generate the retrained predictive model.

Each retrained predictive model that is generated using the new trainingdata from the training data queue 213 can be scored to estimate theeffectiveness of the model. That is, an effectiveness score can begenerated, for example, in the manner described above. In someimplementations, the effective score of a retrained predictive model isdetermined by tallying the results from the initial cross-validation(i.e., done for the updateable predictive model from which the retrainedpredictive was generated) and adding in the retrained predictive model'sscore on each new piece of training data. By way of illustrativeexample, consider Model A that was trained with a batch of 100 trainingsamples and has an estimated 67% accuracy as determined fromcross-validation. Model A then is updated (i.e., retrained) with 10 newtraining samples, and the retrained Model A gets 5 predictive outputscorrect and 5 predictive outputs incorrect. The retrained Model A'saccuracy can be calculated as (67+5)/(100+10)=65%.

In some implementations, the effectiveness score of the retrainedpredictive model is compared to the effectiveness score of the trainedpredictive model from which the retrained predictive model was derived.If the retrained predictive model is more effective, then the retrainedpredictive model can replace the initially trained predictive model inthe predictive model repository 215. If the retrained predictive modelis less effective, then it can be discarded. In other implementations,both predictive models are stored in the repository, which thereforegrows in size. In other implementations, the number of predictive modelsstored in the repository 215 is fixed, e.g., to n models where n is aninteger, and only the trained predictive models with the top neffectiveness scores are stored in the repository. Other techniques canbe used to decide which trained predictive models to store in therepository 215.

If the predictive model repository 215 included one or more staticpredictive models, that is, trained predictive models that are notupdateable with incremental new training data, then those models are notupdated during this update phase (i.e., update phase where an update ofonly the updateable predictive models is occurring). From the trainedpredictive models available to the client computing system 202,including the “new” retrained predictive models and the “old” statictrained predictive models, a trained predictive model can be selected toprovide to the client computing system 202. For example, theeffectiveness scores of the available trained predictive models can becompared, and the most effective trained predictive model selected. Theclient computing system 202 can receive access to the selected trainedpredictive model (Box 506).

In some instances, the selected trained predictive model is the sametrained predictive model that was selected and provided to the clientcomputing system 202 after the trained predictive models in therepository 215 were trained with the initial training data or a previousbatch of training data from the training data queue. That is, the mosteffective trained predictive model from those available may remain thesame even after an update. In other instances, a different trainedpredictive model is selected as being the most effective. Changing thetrained predictive model that is accessible by the client computingsystem 202 can be invisible to the client computing system 202. That is,from the perspective of the client computing system 202, input data anda prediction request is provided to the accessible trained predictivemodel (Box 508). In response, a predictive output is received by theclient computing system 202 (Box 510). The selected trained predictivemodel is used to generate the predictive output based on the receivedinput. However, if the particular trained predictive model being usedsystem-side changes, this can make no difference from the perspective ofthe client computing system 202, other than, a more effective model isbeing used and therefore the predictive output should be correspondinglymore accurate as a prediction.

From the perspective of the client computing system 202, updating theupdateable trained predictive models is relatively simple. The updatingcan be all done remote from the client computing system 202 withoutexpending client computing system resources. In addition to updating theupdateable predictive models, the static predictive models can be“updated”. The static predictive models are not actually “updated”, butrather new static predictive models can be generated using training datathat includes new training data. Updating the static predictive modelsis described in further detail below in reference to FIG. 7.

FIG. 6 is a flowchart showing an example process 600 for retrainingupdateable trained predictive models using the predictive analyticplatform. For illustrative purposes, the process 600 is described inreference to the predictive modeling server system 206 of FIG. 2,although it should be understood that a differently configured systemcould perform the process 600. The process 600 begins with providingaccess to an initial trained predictive model (e.g., trained predictivemodel 218) that was trained with initial training data (Box 602). Thatis, for example, operations such as those described above in referenceto boxes 402-412 of FIG. 4 can have already occurred such that a trainedpredictive model has been selected (e.g., based on effectiveness) andaccess to the trained predictive model has been provided, e.g., to theclient computing system 202.

A series of training data sets are received from the client computingsystem 202 (Box 604). For example, as described above, the series oftraining data sets can be received incrementally or can be receivedtogether as a batch. The series of training data sets can be stored inthe training data queue 213. When a first condition is satisfied (“yes”branch of box 606), then an update of updateable trained predictivemodels stored in the predictive model repository 215 occurs. Until thefirst condition is satisfied (“no” branch of box 606), access cancontinue to be provided to the initial trained predictive model (i.e.,box 602) and new training data can continue to be received and added tothe training data queue 213 (i.e., box 604).

The first condition that can trigger can update of updateable trainedpredictive models can be selected to accommodate various considerations.Some example first conditions were already described above in referenceto FIG. 5. That is, receiving new training data in and of itself cansatisfy the first condition and trigger the update. Receiving an updaterequest from the client computing system 202 can satisfy the firstcondition. Other examples of first condition include a threshold size ofthe training data queue 213. That is, once the volume of data in thetraining data queue 213 reaches a threshold size, the first conditioncan be satisfied and an update can occur. The threshold size can bedefined as a predetermined value, e.g., a certain number of kilobytes ofdata, or can be defined as a fraction of the training data included inthe training data repository 214. That is, once the amount of data inthe training data queue is equal to or exceeds x % of the data used toinitially train the trained predictive model 218 or x % of the data inthe training data repository 214 (which may be the same, but could bedifferent), the threshold size is reached. In another example, once apredetermine time period has expired, the first condition is satisfied.For example, an update can be scheduled to occur once a day, once a weekor otherwise. In another example, if the training data is categorized,then when the training data in a particular category included in the newtraining data reaches a fraction of the initial training data in theparticular category, then the first condition can be satisfied. Inanother example, if the training data can be identified by feature, thenwhen the training data with a particular feature reaches a fraction ofthe initial training data having the particular feature, the firstcondition can be satisfied (e.g., widgets X with scarce property Y). Inyet another example, if the training data can be identified byregression region, then when the training data within a particularregression region reaches a fraction of the initial training data in theparticular regression region (e.g., 10% more in the 0.0 to 0.1 predictedrange), then the first condition can be satisfied. The above areillustrative examples, and other first conditions can be used to triggeran update of the updateable trained predictive models stored in thepredictive model repository 215.

The updateable trained predictive models that are stored in therepository 215 are “updated” with the training data stored in thetraining data queue 213. That is, retrained predictive models aregenerated (Box 608) using: the training data queue 213; the updateabletrained predictive models obtained from the repository 215; and thecorresponding training functions that were initially used to train theupdateable trained predictive models, which training functions areobtained from the training function repository 216.

The effectiveness of each of the generated retrained predictive modelsis estimated (Box 610). The effectiveness can be estimated, for example,in the manner described above in reference to FIG. 5 and aneffectiveness score for each retrained predictive model can begenerated.

A trained predictive model is selected from the multiple trainedpredictive models based on their respective effectiveness scores. Thatis, the effectiveness scores of the retrained predictive models and theeffectiveness scores of the trained predictive models already stored inthe repository 215 can be compared and the most effective model, i.e., afirst trained predictive model, selected. Access is provided to thefirst trained predictive model to the client computing system 202 (Box612). As was discussed above, in some implementations, the effectivenessof each retrained predictive model can be compared to the effectivenessof the updateable trained predictive model from which it was derived,and the most effective of the two models stored in the repository 215and the other discarded. In some implementations, this step can occurfirst and then the effectiveness scores of all of the models stored inthe repository 215 can be compared and the first trained predictivemodel selected. As was also discussed above, the first trainedpredictive model may end up being the same model as the initial trainedpredictive model that was provided to the client computing system 202 inBox 602. That is, even after the update, the initial trained predictivemodel may still be the most effective model. In other instances, adifferent trained predictive model may end up being the most effective,and therefore the trained predictive model to which the client computingsystem 202 has access changes after the update.

Of the multiple retrained predictive models that were trained asdescribed above, some or all of them can be stored in the predictivemodel repository 215. In some implementations, the predictive modelsstored in the repository 215 are trained using the entire new trainingdata, i.e., all K partitions and not just K−1 partitions. In otherimplementations, the trained predictive models that were generated in anevaluation phase using K−1 partitions are stored in the repository 215,so as to avoid expending additional resources to recomputed the trainedpredictive models using all K partitions.

In the implementations described above, the first trained predictivemodel is hosted by the dynamic predictive modeling server system 206 andcan reside and execute on a computer at a location remote from theclient computing system 202. However, as described above in reference toFIG. 4, in some implementations, once a predictive model has beenselected and trained, the client entity may desire to download thetrained predictive model to the client computing system 202 orelsewhere. The client entity may wish to generate and deliver predictiveoutputs on the client's own computing system or elsewhere. Accordingly,in some implementations, the first trained predictive model 218 isprovided to a client computing system 202 or elsewhere, and can be usedlocally by the client entity.

FIG. 7 is a flowchart showing an example process 700 for generating anew set of trained predictive models using updated training data. Forillustrative purposes, the process 700 is described in reference to thepredictive modeling server system 206 of FIG. 2, although it should beunderstood that a differently configured system could perform theprocess 700. The process 700 begins with providing access to a firsttrained predictive model (e.g., trained predictive model 218) (Box 702).That is, for example, operations such as those described above inreference to boxes 602-612 of FIG. 6 can have already occurred such thatthe first trained predictive model has been selected (e.g., based oneffectiveness) and access to the first trained predictive model has beenprovided, e.g., to the client computing system 202. In another example,the first trained predictive model can be a trained predictive modelthat was trained using the initial training data. That is, for example,operations such as those described above in reference to boxes 402-412of FIG. 4 can have already occurred such that a trained predictive modelhas been selected (i.e., the first trained predictive model) and accessto the first trained predictive model has been provided. Typically, theprocess 700 occurs after some updating of the updateable trainedpredictive models has already occurred (i.e., after process 600),although that is not necessarily the case.

Referring again to FIG. 7, when a second condition is satisfied (“yes”branch of box 704), then an “update” of some or all the trainedpredictive models stored in the predictive model repository 215 occurs,including the static trained predictive models. This phase of updatingis more accurately described as a phase of “regeneration” rather thanupdating. That is, the trained predictive models from the repository 215are not actually updated, but rather a new set of trained predictivemodels are generated using different training data then was used toinitially train the models in the repository (i.e., the different thanthe initial training data in this example).

Updated training data is generated (Box 706) that will be used togenerate the new set of trained predictive models. In someimplementations, the training data stored in the training data queue 213is added to the training data that is stored in the training datarepository 214. The merged set of training data can be the updatedtraining data. Such a technique can work well if there are noconstraints on the amount of data that can be stored in the trainingdata repository 214. However, in some instances there are suchconstraints, and a data retention policy can be implemented to determinewhich training data to retain and which to delete for purposes ofstoring training data in the repository 214 and generating the updatedtraining data. The data retention policy can define rules governingmaintaining and deleting data. For example, the policy can specify amaximum volume of training data to maintain in the training datarepository, such that if adding training data from the training dataqueue 213 will cause the maximum volume to be exceeded, then some of thetraining data is deleted. The particular training data that is to bedeleted can be selected based on the date of receipt (e.g., the oldestdata is deleted first), selected randomly, selected sequentially if thetraining data is ordered in some fashion, based on a property of thetraining data itself, or otherwise selected.

A particular illustrative example of selecting the training data todelete based on a property of the training data can be described interms of a trained predictive model that is a classifier and thetraining data is multiple feature vectors. An analysis can be performedto determine ease of classification of each feature vector in thetraining data using the classifier. A set of feature vectors can bedeleted that includes a larger proportion of “easily” classified featurevectors. That is, based on an estimation of how hard the classificationis, the feature vectors included in the stored training data can bepruned to satisfy either a threshold volume of data or anotherconstraint used to control what is retained in the training datarepository 214.

For illustrative purposes, in one example the updated training data canbe generated by combining the training data in the training data queuetogether with the training data already stored in the training datarepository 216 (e.g., the initial training data). In someimplementations, the updated training data can then be stored in thetraining data repository 214 and can replace the training data that waspreviously stored (to the extent that the updated training data isdifferent). In some implementations, the training data queue 213 can becleared to make space to new training data to be received in the future.

A new set of trained predictive models is generated using the updatedtraining data and using training functions that are obtained from thetraining function repository 216 (Box 708). The new set of trainedpredictive models includes at least some updateable trained predictivemodels and can include at least some static trained predictive models.

The effectiveness of each trained predictive model in the new set can beestimated, for example, using techniques described above (Step 710). Insome implementations, an effectiveness score is generated for each ofthe new trained predictive models.

A second trained predictive model can be selected to which access isprovided to the client computing system 202 (Box 712). In someimplementations, the effectiveness scores of the new trained predictivemodels and the trained predictive models stored in the repository 215before this updating phase began are all compared and the most effectivetrained predictive model is selected as the second trained predictivemodel. In some implementations, the trained predictive models that werestored in the repository 215 before this updating phase began arediscarded and replaced with the new set of trained predictive models,and the second trained predictive model is selected from the trainedpredictive models currently stored in the repository 215. In someimplementations, the static trained predictive models that were storedin the repository 215 before the updating phase began are replaced bytheir counterpart new static trained predictive models. The updateabletrained predictive models that were stored in the repository 215 beforethe updating phase are either replaced by their counterpart new trainedpredictive model or maintained, depending on which of the two is moreeffective. The second trained predictive model then can be selected fromamong the trained predictive models stored in the repository 215.

In some implementations, only a predetermined number of predictivemodels are stored in the repository 215, e.g., n (where n is an integergreater than 1), and the trained predictive models with the top neffectiveness scores are selected from among the total availablepredictive models, i.e., from among the new set of trained predictivemodels and the trained predictive models that were stored in therepository 215 before the updating phase began. Other techniques can beused to determine which trained predictive models to store in therepository 215 and which pool of trained predictive models is used fromwhich to select the second trained predictive model.

Referring again to Box 704, until the second condition is satisfiedwhich triggers the update of all models included in the repository 215with updated training data (“No” branch of box 704), the clientcomputing system 202 can continue to be provided access to the firsttrained predictive model.

FIG. 8 is a flowchart showing an example process 800 for maintaining anupdated dynamic repository of trained predictive models. The repositoryof trained predictive models is dynamic in that new training data can bereceived and used to update the trained predictive models included inthe repository by retraining the updateable trained predictive modelsand regenerating the static and updateable trained predictive modelswith updated training data. The dynamic repository can be maintained ata location remote from a computing system that will use one or more ofthe trained predictive models to generate predictive output. By way ofillustrative and non-limiting example, the dynamic repository can bemaintained by the predictive modeling server system 206 shown in FIG. 2for the client computing system 202. In other implementations, thecomputing system can maintain the dynamic repository locally. For thepurpose of describing the process 800, reference shall be made to thesystem shown in FIG. 2, although it should be understood that adifferent configured system can be used to perform the process (e.g., ifthe computing system is maintaining the dynamic repository locally).

When this process 800 begins, a set of trained predictive models existsthat includes one or more updateable trained predictive models and oneor more static trained predictive models that were previously generatedfrom a set of training data stored in the training data repository 214and a set of training functions stored in the training functionrepository 216. The set of trained predictive models is stored in thepredictive model repository 215. A series of new training data sets arereceived (Box 702). The sets of training data can be receivedincrementally (i.e., serially) or together in one or more batches. Thetraining data sets are added to the training data queue 213. Newtraining data can continue to accumulate in the training data queue 213as new training data sets are received. The training data sets are “new”in that they are new as compared to the training data in the trainingdata repository 214 that was used to train the set of trained predictivemodels in the predictive model repository 215.

When a first condition is satisfied (“yes” branch of box 806), then anupdate of updateable trained predictive models stored in the predictivemodel repository 215 occurs. The first condition that can trigger canupdate of updateable trained predictive models can be selected toaccommodate various considerations. Some example first conditions werealready described above in reference to FIG. 6, although otherconditions can be used as the first condition. Until the first conditionis satisfied (“no” branch of box 806), training data sets can becontinued to be received and added to the training data queue 213.

When the first condition is satisfied, an update of the updateabletrained predictive models stored in the repository 215 is triggered. Theupdateable trained predictive models that are stored in the repository215 are “updated” with the training data stored in the training dataqueue 213. That is, retrained predictive models are generated (Box 808)using: the training data queue 213; the updateable trained predictivemodels obtained from the repository 215; and the corresponding trainingfunctions that were previously used to train the updateable trainedpredictive models, which training functions are obtained from thetraining function repository 216.

The predictive model repository 215 is updated (Box 810). In someimplementations, the predictive model repository 215 is updated byadding the retrained predictive models to the trained predictive modelsalready stored in the repository 215, thereby increasing the totalnumber of trained predictive models in the repository 215. In otherimplementations, each of the trained predictive models in the repository215 is associated with an effectiveness score and the effectivenessscores of the retrained predictive models are generated. Theeffectiveness score of each retrained predictive model can be comparedto the effectiveness score of the updateable trained predictive modelfrom which it was derived, and the most effective of the two modelsstored in the repository 215 and the other discarded, therebymaintaining the same total number of trained predictive models in therepository 215. In other implementations, where there is a desire tomaintain only n trained predictive models in the repository (where n isan integer greater than 1), the effectiveness scores of the retrainedpredictive models and the trained predictive models already stored inthe repository 215 can be compared and the n most effective trainedpredictive models stored in the repository 215 and the others discarded.Other techniques can be used to determine which trained predictivemodels to store in the repository 215 after the updateable trainedpredictive models have been retrained.

The training data repository 214 is updated (Box 812). In someimplementations, the training data stored in the training data queue 213is added to the training data that is stored in the training datarepository 214. The merged set of training data can be the updatedtraining data. In other implementations, a data retention policy can beimplemented to determine which training data to retain and which todelete for purposes of updating the training data repository 214. As wasdescribed above in reference to FIG. 7, a data retention policy candefine rules governing maintaining and deleting data. For example, thepolicy can specify a maximum volume of training data to maintain in thetraining data repository, such that if adding training data from thetraining data queue 213 will cause the maximum volume to be exceeded,then some of the training data is deleted. The particular training datathat is to be deleted can be selected based on the date of receipt(e.g., the oldest data is deleted first), selected randomly, selectedsequentially if the training data is ordered in some fashion, based on aproperty of the training data itself, or otherwise selected. Othertechniques can be used to determine which training data from thereceived series of training data sets is stored in the training datarepository 214 and which training data already in the repository 214 isretained.

When a second condition is satisfied (“yes” branch of box 814), then an“update” of all the trained predictive models stored in the predictivemodel repository 215 occurs, including both the static trainedpredictive models and the updateable trained predictive models. Thisphase of updating is more accurately described as a phase of“regeneration” rather than updating. That is, the trained predictivemodels from the repository 215 are not actually updated, but rather anew set of trained predictive models are generated using differenttraining data then was previously used to train the models in therepository 215. The new set of trained predictive models are generatedusing the updated training data repository 214 and multiple trainingfunctions obtained from the training function repository 216 (Box 816).The updated training data repository 214 can include some (or all) ofthe same training data that was previously used to train the existingset of models in the repository in addition to some (or all) of thereceived series of training data sets that were received since the lastoccurrence of the second condition being satisfied.

The predictive model repository is updated (Box 818). In someimplementations, the trained predictive models that were stored in therepository 215 before the second condition was satisfied (i.e., beforethis updating phase began) are discarded and replaced with the new setof trained predictive models. In some implementations, the statictrained predictive models that were stored in the repository 215 beforethe updating phase began are replaced by their counterpart new statictrained predictive models. However, the updateable trained predictivemodels that were stored in the repository 215 before the updating phaseare either replaced by their counterpart new trained predictive model ormaintained, depending on which of the two is more effective (e.g., basedon a comparison of effectiveness scores). In some implementations, onlya predetermined number of predictive models are stored in the repository215, e.g., n (where n is an integer greater than 1), and the trainedpredictive models with the top n effectiveness scores are selected fromamong the total available predictive models, i.e., from among the newset of trained predictive models and the trained predictive models thatwere stored in the repository 215 before the updating phase began. Insome implementations, only trained predictive models with aneffectiveness score exceeding a predetermined threshold score are storedin the repository 215 and all others are discarded. Other techniques canbe used to determine which trained predictive models to store in therepository 215.

Although the process 800 was described in terms of the first conditionbeing satisfied first to trigger an update of only the updateabletrained predictive models followed by the second condition beingsatisfied to trigger an update of all of the trained predictive models,it should be understood that the steps of process 800 do not require theparticular order shown. That is, determinations as to whether firstcondition is satisfied and whether the second condition is satisfied canoccur in parallel. In some instances, the second condition can besatisfied to trigger an update of all of the trained predictive modelsbefore the first condition has been satisfied. By way of illustrativeexample, the first condition may require that a threshold volume of newtraining data accumulate in the training data queue 213. The secondcondition may require that a certain predetermined period of time hasexpired. The period of time could expire before the threshold volume ofnew training data has been received. Accordingly, all of the trainedpredictive models in the repository 215 may be updated using updatedtraining data, before the updateable trained predictive models wereupdated with the incremental new training data. Other scenarios arepossible, and the above is but one illustrative example.

Various implementations of the systems and techniques described here maybe realized in digital electronic circuitry, integrated circuitry,specially designed ASICs (application specific integrated circuits),computer hardware, firmware, software, and/or combinations thereof.These various implementations may include implementation in one or morecomputer programs that are executable and/or interpretable on aprogrammable system including at least one programmable processor, whichmay be special or general purpose, coupled to receive data andinstructions from, and to transmit data and instructions to, a storagesystem, at least one input device, and at least one output device.

These computer programs (also known as programs, software, softwareapplications or code) include machine instructions for a programmableprocessor, and may be implemented in a high-level procedural and/orobject-oriented programming language, and/or in assembly/machinelanguage. As used herein, the terms “machine-readable medium”“computer-readable medium” refers to any computer program product,apparatus and/or device (e.g., magnetic discs, optical disks, memory,Programmable Logic Devices (PLDs)) used to provide machine instructionsand/or data to a programmable processor, including a machine-readablemedium that receives machine instructions as a machine-readable signal.The term “machine-readable signal” refers to any signal used to providemachine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniquesdescribed here may be implemented on a computer having a display device(e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor)for displaying information to the user and a keyboard and a pointingdevice (e.g., a mouse or a trackball) by which the user may provideinput to the computer. Other kinds of devices may be used to provide forinteraction with a user as well; for example, feedback provided to theuser may be any form of sensory feedback (e.g., visual feedback,auditory feedback, or tactile feedback); and input from the user may bereceived in any form, including acoustic, speech, or tactile input.

The systems and techniques described here may be implemented in acomputing system that includes a back end component (e.g., as a dataserver), or that includes a middleware component (e.g., an applicationserver), or that includes a front end component (e.g., a client computerhaving a graphical user interface or a Web browser through which a usermay interact with an implementation of the systems and techniquesdescribed here), or any combination of such back end, middleware, orfront end components. The components of the system may be interconnectedby any form or medium of digital data communication (e.g., acommunication network). Examples of communication networks include alocal area network (“LAN”), a wide area network (“WAN”), and theInternet.

The computing system may include clients and servers. A client andserver are generally remote from each other and typically interactthrough a communication network. The relationship of client and serverarises by virtue of computer programs running on the respectivecomputers and having a client-server relationship to each other.

While this specification contains many specific implementation details,these should not be construed as limitations on the scope of anyinvention or of what may be claimed, but rather as descriptions offeatures that may be specific to particular embodiments of particularinventions. Certain features that are described in this specification inthe context of separate embodiments can also be implemented incombination in a single embodiment. Conversely, various features thatare described in the context of a single embodiment can also beimplemented in multiple embodiments separately or in any suitablesubcombination. Moreover, although features may be described above asacting in certain combinations and even initially claimed as such, oneor more features from a claimed combination can in some cases be excisedfrom the combination, and the claimed combination may be directed to asubcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particularorder, this should not be understood as requiring that such operationsbe performed in the particular order shown or in sequential order, orthat all illustrated operations be performed, to achieve desirableresults. In certain circumstances, multitasking and parallel processingmay be advantageous. Moreover, the separation of various systemcomponents in the embodiments described above should not be understoodas requiring such separation in all embodiments, and it should beunderstood that the described program components and systems cangenerally be integrated together in a single software product orpackaged into multiple software products.

A number of embodiments have been described. Nevertheless, it will beunderstood that various modifications may be made without departing fromthe spirit and scope of the invention.

In addition, the logic flows depicted in the figures do not require theparticular order shown, or sequential order, to achieve desirableresults. In addition, other steps may be provided, or steps may beeliminated, from the described flows, and other components may be addedto, or removed from, the described systems. Accordingly, otherembodiments are within the scope of the following claims.

What is claimed is:
 1. A system comprising: one or more computers; andone or more storage devices coupled to the one or more computers andstoring: a repository of training functions, a repository of trainedpredictive models comprising static trained predictive models andupdateable trained predictive models, a training data queue, a trainingdata repository, and instructions that, when executed by the one or morecomputers, cause the one or more computers to perform operationscomprising: receiving a series of training data sets; adding thetraining data sets to the training data queue; in response to a firstcondition being satisfied, generating a plurality of retrainedpredictive models using the training data queue, a plurality ofupdateable trained predictive models obtained from the repository oftrained predictive models, and a plurality of training functionsobtained from the repository of training functions; updating therepository of trained predictive models by storing one or more of theplurality of generated retrained predictive models; in response to asecond condition being satisfied, generating a plurality of new trainedpredictive models using the training data queue and at least some of thetraining data stored in the training data repository and using aplurality of training functions obtained from the repository of trainingfunctions, wherein the plurality of new trained predictive modelscomprise static trained predictive models and updateable trainedpredictive models; and updating the repository of trained predictivemodels by storing at least some of the plurality of new trainedpredictive models.
 2. The system of claim 1, wherein the series oftraining data sets are received incrementally.
 3. The system of claim 1,wherein the series of training data sets are received together in abatch.
 4. The system of claim 1, wherein the first condition issatisfied when a size of the training data queue is greater than orequal to a threshold size.
 5. The system of claim 1, wherein the firstcondition is satisfied in response to receiving a command to update theplurality of updateable trained predictive models included in therepository of trained predictive models.
 6. The system of claim 1,wherein the first condition is satisfied after a predetermined timeperiod has expired.
 7. The system of claim 1, wherein the secondcondition is satisfied in response to receiving a command to update thestatic models and the updateable models included in the repository oftrained predictive models.
 8. The system of claim 1, wherein the secondcondition is satisfied after a predetermined time period has expired. 9.The system of claim 1, wherein the second condition is satisfied when asize of the training data queue is greater than or equal to a thresholdsize.
 10. The system of claim 1, further comprising: a user interfaceconfigured to receive user input specifying a data retention policy thatdefines rules for maintaining and deleting training data included in thetraining data repository.
 11. The system of claim 1, where theoperations further comprise: generating updated training data thatincludes at least some of the training data from the training data queueand at least some of the training data from the training datarepository; and updating the training data repository by storing theupdated training data.
 12. The system of claim 11, wherein generatingupdated training data comprises implementing a data retention policythat defines rules for maintaining and deleting training data includedin at least one of the training data queue or the training datarepository.
 13. The system of claim 12, wherein the data retentionpolicy includes a rule for deleting training data from the training datarepository when the training data repository size reaches apredetermined size limit.
 14. The system of claim 1, wherein updatingthe repository of trained predictive models by storing one or more ofthe plurality of generated retrained predictive models comprises: foreach of the plurality of retrained predictive models: comparing aneffectiveness score of the retrained predictive model to aneffectiveness score of the updateable trained predictive model from thepredictive model repository that was used to generate the retrainedpredictive model; and based on the comparison, selecting a first of thetwo predictive models to store in the repository of predictive modelsand not storing a second of the two predictive models in the repository;wherein the effectiveness scores are each scores that represents anestimation of the effectiveness of the respective trained predictivemodel.
 15. A computer-implemented method comprising: receiving newtraining data; adding the new training data to a training data queue;determining whether a size of the training data queue size is greaterthan a threshold size; when the training data queue size is greater thanthe threshold size, retrieving a stored plurality of trained predictivemodels and a stored training data set, wherein each of the trainedpredictive models were generated using the training data set and aplurality of training functions, and wherein each of the trainedpredictive models is associated with a score that represents anestimation of the effectiveness of the predictive model; generating aplurality of retrained predictive models using the training data queue,the retrieved plurality of trained predictive models and the pluralityof training functions; generating a new score associated each of thegenerated retrained predictive models; and adding at least some of thetraining data queue to the stored training data set.
 16. The method ofclaim 15, wherein the threshold is a predetermined data size.
 17. Themethod of claim 15, wherein the threshold is a predetermined ratio ofthe training data queue size to a size of the stored training data set.18. A computer-implemented method comprising: receiving a series oftraining data sets; adding the training data sets to a training dataqueue; in response to a first condition being satisfied, generating aplurality of retrained predictive models using the training data queue,a plurality of updateable trained predictive models obtained from arepository of trained predictive models, and a plurality of trainingfunctions obtained from a repository of training functions; updating therepository of trained predictive models by storing one or more of theplurality of generated retrained predictive models; in response to asecond condition being satisfied, generating a plurality of new trainedpredictive models using the training data queue and at least some oftraining data stored in a training data repository and using a pluralityof training functions obtained from the repository of trainingfunctions, wherein the plurality of new trained predictive modelscomprise static trained predictive models and updateable trainedpredictive models; and updating the repository of trained predictivemodels by storing at least some of the plurality of new trainedpredictive models.
 19. The method of claim 18, wherein the firstcondition is satisfied when a size of the training data queue is greaterthan or equal to a threshold size.
 20. The method of claim 18, whereinthe second condition is satisfied when a predetermined period of timehas expired.